fix(client): self-healing for permanently stuck expired shape handles by KyleAMathews · Pull Request #4087 · electric-sql/electric

KyleAMathews · 2026-04-03T15:11:20Z

Summary

Expired shape handle entries in localStorage can get permanently stuck, preventing data from ever loading for affected shapes. This adds a self-healing retry mechanism that clears the poisoned entry and retries once, allowing automatic recovery even when a proxy strips cache-buster query parameters.

Based on #4085 by @evan-liveflow — refined with additional hardening from code review.

Root Cause

When a shape gets a 409 (handle rotation), the client stores the old handle in localStorage['electric_expired_shapes']. On future requests, if a response contains that handle, the client treats it as a stale cached response and retries up to 3 times with cache-buster params.

The problem: if a proxy (e.g., phoenix_sync) strips query parameters, the cache busters are ineffective. All 3 retries fail, FetchError(502) is thrown to onError, and if onError doesn't retry, the stream dies. The expired entry persists in localStorage, so the next session hits the same wall — permanently.

Since the server never reuses handles (now documented as SPEC.md S0), the expired entry becomes a false positive once the caching layer clears — but the client has no way to discover this.

Approach

After stale cache retries exhaust (3 attempts), the client now:

Always clears the expired entry from localStorage — if cache busters didn't work, keeping the entry only poisons future sessions
Attempts one self-healing retry — resets the stream and retries without the expired_handle param. Since handles are never reused, the fresh response will have a new handle and won't trigger stale detection
Guards against infinite loops via #expiredShapeRecoveryKey (once per shape key, reset on up-to-date)

if (transition.exceededMaxRetries) {
  if (shapeKey) {
    expiredShapesCache.delete(shapeKey)       // always clear
    if (this.#expiredShapeRecoveryKey !== shapeKey) {
      this.#expiredShapeRecoveryKey = shapeKey // remember we tried
      this.#reset()                            // fresh start
      throw new StaleCacheError(...)           // caught internally → retry
    }
  }
  throw new FetchError(502, ...)               // truly give up
}

Key Invariants

S0: Server handles are unique and never reused (phash2 + microsecond timestamp, SQLite UNIQUE INDEX, ETS insert_new)
Self-healing fires at most once per shape per retry cycle (#expiredShapeRecoveryKey guard)
Guard resets on up-to-date, so long-lived streams can self-heal again if CDN misbehaves later
Expired entry is cleared on every exhaustion, regardless of whether self-healing fires

Non-goals

TTL on expired cache entries — the self-healing mechanism handles the failure mode without added complexity
Changing onError contract — the fix works regardless of what the user's onError callback does

Verification

cd packages/typescript-client
pnpm vitest run --config vitest.unit.config.ts
# 312 tests pass
pnpm exec tsc --noEmit
# Clean

Files changed

File	Change
`src/client.ts`	Self-healing logic in `#onInitialResponse`, recovery key cleared on up-to-date, updated catch block comment
`test/expired-shapes-cache.test.ts`	Updated 2 existing tests for self-healing flow, added test for CDN-always-stale scenario
`SPEC.md`	Added S0 (handle uniqueness guarantee), updated L3 loop-back entry and guard table
`.changeset/fix-expired-shapes-self-healing.md`	Changeset for patch release

Based on #4085

When stale cache retries exhaust (3 attempts), clear the expired entry from localStorage and retry once without the expired_handle param. Since handles are never reused (SPEC.md S0), the fresh response gets a new handle and bypasses stale detection. This prevents shapes from being permanently unloadable when a proxy strips cache-buster query params. Also documents the server handle uniqueness guarantee (S0) in the spec, updates the loop-back table for the new self-healing path, and resets the recovery guard on up-to-date so self-healing remains available for long-lived streams. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

netlify · 2026-04-03T15:12:39Z

✅ Deploy Preview for electric-next ready!

Name	Link
🔨 Latest commit	`19efb39`
🔍 Latest deploy log	https://app.netlify.com/projects/electric-next/deploys/69d92c95e16c9600087983a7
😎 Deploy Preview	https://deploy-preview-4087--electric-next.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

pkg-pr-new · 2026-04-03T15:12:47Z

Open in StackBlitz

npm i https://pkg.pr.new/@electric-sql/react@4087

npm i https://pkg.pr.new/@electric-sql/client@4087

npm i https://pkg.pr.new/@electric-sql/y-electric@4087

commit: 424fe62

codecov · 2026-04-03T15:14:52Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.78%. Comparing base (b449f70) to head (424fe62).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4087   +/-   ##
=======================================
  Coverage   88.78%   88.78%           
=======================================
  Files          25       25           
  Lines        2452     2471   +19     
  Branches      615      627   +12     
=======================================
+ Hits         2177     2194   +17     
- Misses        273      275    +2     
  Partials        2        2

Flag	Coverage Δ
packages/experimental	`87.73% <ø> (ø)`
packages/react-hooks	`86.48% <ø> (ø)`
packages/start	`82.83% <ø> (ø)`
packages/typescript-client	`93.86% <100.00%> (-0.05%)`	⬇️
packages/y-electric	`56.05% <ø> (ø)`
typescript	`88.78% <100.00%> (+<0.01%)`	⬆️
unit-tests	`88.78% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…test The test waited for fast-loop detection to error, but the exponential backoff (100ms-5s across 5 detections) takes longer than the timeout in CI. Simplified to verify self-healing fires and the entry is cleared — the fast-loop error path is already tested in stream.test.ts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The #expiredShapeRecoveryKey guard was only cleared in #onMessages when an up-to-date batch arrived. The 204 backward-compatibility path transitions directly to LiveState without going through #onMessages (empty body → batch.length === 0 → early return), leaving the guard stuck. This prevented a second self-healing cycle on the same stream instance. Clear the guard in #onInitialResponse when the response transitions directly to live (action=accepted, state=live), covering the 204 path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…test The caughtError===null assertion was environment-sensitive: the fast-loop detector's 500ms window can catch more requests on slower machines, firing a 502 that's orthogonal to the recovery guard bug being tested. The precise signal is selfHealCount===2: if the guard is stuck, the code throws 502 *before* incrementing, so selfHealCount stays at 1. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

balegas · 2026-04-10T14:59:16Z

Code Review

Overview

The core fix is correct. I traced the stale-retry → self-healing flow end-to-end and verified body cancellation, guard clearing race-freedom, and reset ordering all hold. SPEC.md S0 is the right place to anchor the handle-uniqueness assumption the mechanism depends on.

Suggestions

1. The 204 guard-clearing check is looser than the comment suggests (client.ts:1244-1246)

if (transition.action === `accepted` && this.#syncState.kind === `live`) {
  this.#expiredShapeRecoveryKey = null
}

The comment says "e.g., 204" but the check fires whenever accepted + state becomes live. It works today only because the state machine doesn't transition to live at response-metadata stage for 200s — if that assumption ever changes, this will start clearing the guard prematurely. Consider either:

Narrowing to an explicit status === 204 check (matches the comment's intent), or
Adding a state-machine assertion/test to lock in the assumption

Not blocking — defensive hardening.

2. #reset() comment is now inaccurate (client.ts:1770)

The comment on #reset() says it's called from within the running stream loop (#requestShape's 409 handler) — self-healing is now a second caller. A future edit that relies on "only 409 handler" semantics could break self-healing. One-line fix:

it's called from within the running stream loop (#requestShape's 409 handler
or stale-retry self-healing)

3. Self-healing silently accepts stale data

The "CDN always returns stale handle" case documents that after self-healing clears the expired entry, the client happily accepts the same stale response. The trade-off (stale data > permanent 502) is the right call, but there's no post-hoc signal to the user that they're on stale data. Consider logging a follow-up warning when a post-self-healing response arrives with the handle that was just marked-and-cleared. Low priority — could be a separate PR if telemetry becomes a need.

Test concerns

1. Timing-sensitive setTimeout waits (200ms, 500ms, 1000ms)

The test history (1f7a5f4, 3c90f21) shows the author already fought flakiness here. Consider adopting the deterministic FetchGate pattern from #4089 once that lands — turn-based gating eliminates all timing races. The two PRs are complementary.

2. The 204-path test is doing too much (should clear recovery guard after 204 so self-healing works again)

105 lines of mock logic with multi-phase state (selfHealCount, healingDone, mid-test markExpired, manual aborter.abort()), and asserts only selfHealCount === 2 because the fast-loop detector may fire on slower machines. That's a signal the test is conflating concerns. Consider splitting:

Unit test: "204 clears recovery guard" against #onInitialResponse directly
Integration test: "Self-healing can fire multiple times after recovery guard clears"

3. markExpired mutation inside fetchMock.mockImplementation (test/expired-shapes-cache.test.ts:418)

Mutating expiredShapesCache from inside the fetch mock couples the test to an event ordering that's not obvious to future readers. At minimum, a // Simulate a second 409 happening mid-stream comment; ideally, restructure so the mutation happens outside the mock callback.

Recommendation

Approve with minor suggestions. Fix #1 (tighten the 204 check) and #2 (comment update) are cheap safety wins worth doing before merge. The rest are non-blocking.

Addresses three findings from external review of the self-healing PR: 1. Detect and warn when a post-self-heal response carries the same handle we just marked expired. Previously the client silently accepted stale data with no operator signal — now it emits a targeted warning naming the handle and pointing at the proxy cache-key misconfiguration that causes this. 2. Tighten the recovery-guard clearing check from `accepted + kind === live` to an explicit `status === 204`, matching the comment's intent and removing latent fragility if the state machine ever starts transitioning to live for non-204 responses. 3. Update the `#reset()` comment to list all three callers (#requestShape's 409 handler, #checkFastLoop, and stale-retry self-healing) instead of only the 409 handler. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Two orthogonal CI fixes for the expired-shapes-cache test suite: 1. Silence stderr noise: many tests in this suite intentionally trigger stale-cache scenarios that produce expected console.warn output. Install a shared beforeEach spy that mocks console.warn for all tests (tests that need to assert on warnings still can, via warnSpy.mock.calls). 2. Prevent unhandled FetchError(502) from fast-loop detector: - "CDN always returns stale handle" test: add onError handler and poll until self-heal fires, then abort explicitly so the stream can't loop in the background until #checkFastLoop throws. - "204 recovery guard" test: add onError handler so the fast-loop detector (which the test's own comment acknowledges may race on slow runners) can't leak as an unhandled rejection. Both tests still assert the same signals — this change affects test plumbing only. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Package-local prettier 3.3.3 and root prettier 3.6.2 format markdown blank-lines-before-lists differently. CI runs root prettier (strips the line); the pre-commit hook picks up package-local prettier (re-adds it), causing lint-staged to think the commit is empty when trying to fix CI. Changesets auto-generates CHANGELOG files, so formatting them manually isn't needed anyway. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Package-local prettier 3.3.3 formats markdown blank-lines-before-lists differently than root prettier 3.6.2. CI runs root prettier; the pre-commit hook picks up package-local prettier — they disagreed, causing lint-staged to think every changelog fix was an empty commit. - Bumped prettier to ^3.6.2 in experimental, react-hooks, start, typescript-client, y-electric, burn/assets, and redis packages (aligning with root 3.6.2; identified via sherif). - Reverted the temporary CHANGELOG.md entry in .prettierignore now that the root cause is fixed. - Reformatted both package CHANGELOG files with 3.6.2 to match CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Using bare "prettier --write" in lint-staged lets nano-spawn's PATH resolution pick up globally-installed prettier (e.g. ~/.bun/bin/prettier 3.4.2) over the workspace's 3.6.2. That global version disagrees with CI on markdown list formatting, causing the pre-commit hook to silently revert the fix and then fail with "empty commit". Pointing explicitly at node_modules/.bin/prettier guarantees lint-staged uses exactly the version pnpm installed. Also reformats the two CHANGELOGs that CI's format:check flagged.

github-actions · 2026-04-10T20:01:43Z

This PR has been released! 🚀

The following packages include changes from this PR:

@electric-sql/client@1.5.15

Thanks for contributing to Electric!

evanob and others added 2 commits April 2, 2026 20:22

Allow expired cache shapes to attempt self-healing once

16f9ccf

KyleAMathews and others added 3 commits April 3, 2026 09:16

evanob mentioned this pull request Apr 8, 2026

Allow expired cache shapes to attempt self-healing once after maxRetries exceeded #4085

Closed

KyleAMathews and others added 6 commits April 10, 2026 10:29

Merge remote-tracking branch 'origin/main' into pr-4085

19efb39

msfstef approved these changes Apr 10, 2026

View reviewed changes

KyleAMathews merged commit 690e25a into main Apr 10, 2026
43 checks passed

KyleAMathews deleted the pr-4085 branch April 10, 2026 18:22

KyleAMathews temporarily deployed to Pull request April 10, 2026 18:22 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(client): self-healing for permanently stuck expired shape handles#4087

fix(client): self-healing for permanently stuck expired shape handles#4087
KyleAMathews merged 11 commits intomainfrom
pr-4085

KyleAMathews commented Apr 3, 2026

Uh oh!

netlify bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

pkg-pr-new bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

balegas commented Apr 10, 2026

Uh oh!

Uh oh!

github-actions bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

KyleAMathews commented Apr 3, 2026

Summary

Root Cause

Approach

Key Invariants

Non-goals

Verification

Files changed

Uh oh!

netlify bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for electric-next ready!

Uh oh!

pkg-pr-new bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

balegas commented Apr 10, 2026

Code Review

Overview

Suggestions

Test concerns

Recommendation

Uh oh!

Uh oh!

github-actions bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

netlify bot commented Apr 3, 2026 •

edited

Loading

pkg-pr-new bot commented Apr 3, 2026 •

edited

Loading

codecov bot commented Apr 3, 2026 •

edited

Loading